Ford GoBike (also known as Bay Wheels system) is a regional public bicycle sharing system in California's San Francisco Bay Area and was introduced in 2013 as a pilot program for the region, with 700 bikes and 70 stations across San Francisco and San Jose.
Ford GoBike consists of a fleet of specially designed, environment-friendly and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips.
People use bike share to commute to work or school, run errands, get to appointments or social engagements and more. It's a fun, convenient and affordable way to get around.
The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.
In June 2017, the system was officially re-launched as Ford GoBike in a partnership with Ford Motor Company. After Motivate's acquisition by Lyft, the system was renamed to Bay Wheels in June 2019.The system is expected to expand to 7,000 bicycles around 540 stations in San Francisco, Oakland, Berkeley, Emeryville, and San Jose.
In this investigation of the Ford GoBike System, I would like to explore the most influential customer behaviors and characteristics, such as user type, gender and age, as well as features like ride duration, timing, distance, stations, and whether the whole ride used GoBike bike or not. Finally, I will investigate how these attributes impact the usage of Bay Wheels system.
The data consisted of 183,412 rows and 16 attributes for 183,412 bike rides. The attributes include the ride statistics, such as ride duration, ride start and end time, and station information and coordinates, user information, such as user type, birth day, and gender, as well as additional features such as bike id and bike share. 17,318 missing data points were imputed, rather than being removed from the analysis as these rows, including these NaN values, are used in analyzing and ploting other attributes of interest.
After wrangling the dataset, the total number of columns has become 27, instead of 16, as we have extracted 11 attributes to facilitate our exploratory analysis and visualization.
The main features that I am most interested in include ride durations, ride times, ride stations and distance, and user types and characteristics. My investigation is primarily to explore the patterns and correlations in these features and how the user behaviors and characteristics in relation to these features influence the usage of GoBike system. I will explore the dataset with a goal to answer the rsearch questions: When are most trips taken in terms of time of day, day of the week, or month of the year?, How long does the average trip take? and Does the above depend on if a user is a subscriber or customer?
The supportive features that can facilitate the investigation and exploration of the features of interest include:
In this exploration, I will investigate and plot the data distribution of the most influential variables in the dataset individually, one feature at a time.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Durations/second is highly skewed and clustered to the left in a long-normal distribution and most of the data falls below 10k seconds for more than 175k of the rides.
- Changing the x-axis limit shows that data is clustered in position between 7.5k-10k seconds.
- Accordingly, I have decided to transform the x-axis to explore the data distribution in depth as follows.
- Transforming the x-axis shows that data is normally distributed with a little bit skewnees to the right with a mean of 726 seconds.
- Most of the rides has a duration between 300-1000 seconds and 95% of the rides has a duration below 1571 seconds.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Duration/minute follows the same distribution of duration/second as it has been actually extracted from it. It has a similar long-normal distribution and skewness to the right. Most of rides (175K) has <150 minutes in duration.
- Scaling the x-axis showed up that most of the data is actually custered in the same position > 150 minutes.
- Again, I have decided to transform the x-axis to explore the data distribution in depth as follows.
- Transforming the x-axis shows that data is normally distributed with a mean of 12 minutes.
- 95% of the rides has a duration below 26 minutes.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Duration/hour follows the same distribution of duration/second as it has been actually extracted from it. It has a similar long-normal distribution and skewness to the right. Most of rides (175K) has <2.5 hours in duration.
- Scaling the x-axis showed up thay data actually is custered in the same position > 2.5 hours.
- Similarly, I have decided to transform the x-axis to explore the data distribution in depth as follows.
- Transforming the x-axis shows that data is highly skewed to the right with a mean of 2 hours.
- 95% of the rides has a duration below 3.3 hours.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Start time hour and end time hour follow the same bimodel distribution with a very small skew to the left.
- Most of rides (50K) starts at 8-9 AM in peak one and at 2-6 PM in peak two (30k-40k rides). Similarly, Most of rides (50K) ends at 8-9 AM in peak one and at 2-6 PM in peak 2 (30k-40k rides).
- I have decided to transform the x-axis to explore the data distribution in depth as follows.
Insights:¶
- Transforming start and end hours shows that they follow the same bimodel distribution skewed to the left.
- Most of rides (15k-25k) starts at 8-11 AM in peak one and at 1-9 PM in peak two (15k-35k rides). Similarly, Most of rides (20K) ends at 8-11 AM in peak one and at 1-9 PM in peak 2 (15k-35k rides). 50% of rides start before 2 PM and mean start hour is 1.45 PM and 50% of rides ends before 2 PM and mean end hour is 1.6 PM.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Bar Chart and Histogram
- Start time days with higher number of rides are 5-7, 11-12, 14-15, 19-22, 25, and 27-28 with average number of rides of 8k-10k rides/day.
- Similarly, End time days with higher number of rides are 5-7, 11-12, 14-15, 19-22, 25, and 27-28 with average number of rides of 8k-10k rides/day.
- Sorting the start and end time days horizontally confirms our notes before and as follows:
- Start time days with higher number of rides are 5-7, 11-12, 14-15, 19-22, and 27-28 with average number of rides of 8k-10k rides/day.
- Similarly, End time days with higher number of rides are 5-7, 11-12, 14-15, 19-22, 25, and 27-28 with average number of rides of 8k-10k rides/day.
- Moreover, days with lower number of rides are 2-3, 9, and 13.
- I have decided to explore more how start and end time days are normally distributed to gain more insights.
- Start time hours and end time hours follow the same bimodel distribution with a skewness to the left.
- Most of start time rides occurs in days 4-8, 13-15, 20-22, and 25-28. Similarly, Most of end time rides occurs in days 4-8, 13-15, 20-22
- I have decided to transform the x-axis to explore the data distribution in depth as follows.
- Transforming start and end time days shows that follow the same bimodel distribution skewed to the left.
- Most of rides starts at days 4-7 in peak one, 10-13 in peak two, and 19-28 in peak three. Similarly, Most of rides ends at days 4-7 in peak one, 10-13 in peak two, and 19-28 in peak three.
- Mean start and end time rides/day is 6,550.
- 50% of start and end time rides occured before day 15.
- Variable Type: Qualitative/Categorical
- Appropariate Plot: Bar Chart and Pie Chart/Waffle Plot
- Both start and end time weekdays follow the same distribution.
- Thursday is the highest weekday in terms of number of rides (35K) while Saturday and Sunday are the lowest days (15k).
- Working weekdays (26k-35k) are higher in number of rides than weekend days (15k).
- It seems that riders are commuters, rather than people who use bikes for fun or sports.
- Both start and end time weekdays follow the same pie distribution.
- Thursday is the highest weekday in terms of proportion of rides (19%) while Saturday and Sunday are the lowest days (8%/day).
- Working weekday (15%-19%/day) are higher in proportion of rides than weekend day (8%/day).
- Average start and end weekday rides/day is 26,201 and 95% of rides are 34,181 for start time weekday and 34,175 for end time weekday.
- Again, It seems that riders are commuters, rather than people who use bikes for fun or sports.
- Variable Type: Qualitative/Categorical
- Appropariate Plot: Bar Chart and Pie Chart
- All rides start at February (183,412) and most of them ends in the same month of February (183,396), except fo 16 rides ended in March.
- 99.99% of rides ends in February and 0.01% ends in March.
- Both start and end stations are skewed to the right with most of rides around the mean of ~ 556 rides/station for both start and end stations.
- Number of rides handled by end stations (4857 rides) is greater than that of start station(3904 rides).
- To better understand the distribution of start and end stations, I will explore how stations are clustered and how they are perfroming in each city. So, first I will visulaize their cluster in cities; then, I will plot each city/its stations.
- From top left to bottom right of the plots above, the 330 stations are clustered in San Jose, Oakland_Berkeley, and San Fransisco cities.
- Based on longitude Coordinate, San Jose is located at latitude > -122.1, Oakland_Berkeley at longitude > -122.35 & < -122.1, and San Fransisco at longitude > -122.5 & > -122.35.
- Accordingly, we can filter stations based on longitude and create subsets of data to mask and visualize the cluster of stations based on their city location. = I will user 'start_station_longitude' to create masks to filter coordinates as both start and end coordinates of stations is roughly the same.
- San Fransisco's stations are greater than both those of Oakland_Berkeley and San Jose in all statistics.
- San Fransisco has 156 stations, Oakland_Berkeley have 127 stations, and San Jose has 47 stations.
- Avg. number of rides per day in San Fransisco is 4,774, in Oakland_Berkeley is 1,480, and in San Jose is 295.
- Avg. number of rides per station in San Fransisco is , in Oakland_Berkeley is 1,480, and in San Jose is 295.
- Avg. number of rides per station/day in San Fransisco is 31, in Oakland_Berkeley is 12, and in San Jose is 6.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Ride distance distibution is highly skewed to the left with a log-normal distribution as >175k of the rides have a distance ~8km.
- I have tried to change the scale limit of the distance to better visualize data. However, the data is highly clustered in the same position of ~7km with some outliers.
- Finally, I have transformed the x-axis to learn more about the distance distribution as shown below.
- Now we have a bimodel distribution with two peaks: One from 1-1.5km and 1.5-2.6km.
- Avg. distance/ride is 1.7km and 95% of rides has a distance of 3.8km.
- For each distance level that ranges from 1-2.6k, there is > 6k rides/level.
- There are 9,170 rides that have a distance above 3.8km.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- Total unique number of bike ids is 4,646.
- Average number of rides per each bike id is 39 rides.
- Max number of rides/bike id is 191 rides and the min is 1 rides.
- 90% of bike ids has number of rides below 100 rides and 10% of bike ids has number of rides be 6 rides.
- Top 10% of bike ids is 465 and lower 10% of bike ids is 565.
- The distribution of bike ids is skewed to the right with most of bike ids clustered around 4k-6k range of most-used ids.
- Variable Type: Qualitative/Categorical
- Appropariate Plot: Bar Chart and Pie Chart/Waffle Plot
- Total number of bikes that weren't shared during the trip is 166,053 and represents 91% of total rides.
- Total number of bikes that were shared during the trip is 17,359 and represents 9% of total rides.
- Variable Type: Qualitative/Categorical
- Appropariate Plot: Bar Chart and Pie Chart/Waffle Plot
- Total number of subscribers is 163,544 and represents 89% of total users.
- Total number of customers is 19,868 and represents 11% of total users.
- Variable Type: Qualitative/Categorical
- Appropariate Plot: Bar Chart and Pie Chart/Waffle Plot
- Total number of male users is 130,651 and represents 71% of total users.
- Total number of female users is 40,844 and represents 22% of total users.
- Total number of nondefined users is 8,265 and represents 5% of total users.
- Total number of other users is 3,652 and represents 2% of total users.
- Variable Type: Quantitative/Numeric
- Appropariate Plot: Histogram
- The member age distribution follows a skewed distribution to the right. Scaling the x-axis didn't change the skewed distribution so transforming x-axis may normally distribute data as follows.
- Average age of members is 32 years old and 75% of ages is under 38 years old and 99% of ages is under 63 years old.
- Member ages follow a skewed distribution to the right with most ages are clustered between 18-40 years old in non-scaled and scaled distributions.
- However, transforming the years' scale showed that most ages are clustered between 23-40 years old.
To get better understanding of age distribution, I think removing age outliers and 0 age value, which we have previously imputed and filled up missing values in 'member_age' attribute with it, will precisely improve the age distribution as follows.
- 99% of ages are below 63 years old with total number of records of 181,598.
- 1% of ages are above 63 years old with total number of records of 1,814.
- Ages equal to 0 years old or below 18 years old is 8,265.
- Total net ages from 18-63 years old is 173,333.
- Both scaled and non-scaled age distribution showed that data is skewed to the right with most ages clustered between 23-39 years old. -Transforming the age scale confirmed very close findings: age are clustered between 23-41 years old with 2 peaks, one from 23-31 and the other from 32-41 years old.
- This leads me to the idea to create member age groupings to compare the distribution of data among these age groups.
- Age groups 20-30 years old is the highest group in terms of rides with a proportion of 40%, followed by 30-40 years old group with a proportion of 36.5%.
- The lowest age group is 70-141 years old with a proportion of 0.3%
I have found that the follwing variables have skewness and long-normal distribution and need x-axis or y-axis transformation:
I have performed many transformations for some x-axis or y-axis scales with skewness and long-normal distribution in order to explore data distribution deeply. I can summarize these transformations as follows:
In this section, I will investigate and plot the correlations, patterns, trends, models, and relationships between a couple of variables in the dataset, two features at a time.
We can plot the following categorical 9 relationships for the major categorical variables of user type, member gender, bike share status, amd ride start and end stations:
- User Type vs. Member Gender & User Type vs. Bike Share (2 plots)
- User Type vs. Age Groups (1 plot)
- User Type vs. Ride Start and End Stations (2 plots)
- Member Gender vs. bike share (1 plots)
- Member Gender vs. Age Groups (1 plot)
- Member Gender vs. Ride Stations (2 plots)
- Bike Share vs. Age Groups (1 plot)
- Bike Share vs. Ride Stations (2 plots)
- Most subscribers are males and represents 73% of total subscribers, followed by females with 22%.
- Most customers are males and represents 58% customers of customers, followed by females with 23%.
- 89.4% of subscribers don't share bikes during their rides and 10.6 % of subscribers share bikes during their rides.
- All customers don't share bikes during their rides.
- For subscribers, age group 20-30 is the highest (39.7%), followed by age group 30-40 (39.4%) and the lowest age group is 70-141(0.001%).
- For customers, age group 20-30 is the highest (40%), followed by age group 30-40 (36.3%) and the lowest age group is 70-141(0.004%).
- Subscribers are using both start and end stations more than customers.
- 90% of males don't share bikes during the ride and 10% share bikes.
- 91% of females don't share bikes during the ride and 9% share bikes.
- All not-defined users don't share bikes during the ride.
- 82% of other users don't share bikes during the ride and 18% share bikes.
- For males and female genders, age group 20-30 is the highest (39%), followed by age group 30-40 (36.3%) and the lowest age group is 70-141 (0.004%).
- For females and female genders, age group 20-30 is the highest (44%), followed by age group 30-40 (36%) and the lowest age group is 70-141 (0.002%).
- For other other gender, age group 30-40 (43.3%) is the highest, followed by age group 20-30 (33%) and the lowest age group is 70-141 (0.0025%).
- Males are using both start and end stations more than other genders.
- For non-shared rides , age group 30-40 is the highest (39%), followed by age group 20-30 (38%) and the lowest age group is 70-141(0.0024%).
- For shared rides, age group 20-30 is the highest (58%), followed by age group 30-40 (13%) and the lowest age group is 70-141(0.01%).
- Non-shared bikes are more than shared bikes in both start and end stations.
We can plot many numberic variable relationships for major numeric variables of member age, ride start and end times, ride durations, and ride distances as follows:
- Member Age vs. Ride Duration in Seconds, Minutes, Hours (3 Plots)
- Member Age vs. Ride Start Time Hour and End Time Hour (2 Plots)
- Member Age vs. Ride Start Time Day and End Time Day(2 Plots)
- Member Age vs. Ride Distance (1 Plots)
- Member Age vs. Bike Id (1 Plots)
- Ride Start and End Time Hour vs. Ride Duration in Seconds, Minutes, and Hours (6 Plots)
- Ride Start and End Time Day vs. Ride Duration in Seconds, Minutes, and Hours (6 Plots)
- Ride Start and End Time Hour vs. Ride distance (2 Plots)
- Ride Start and End Time Day vs. Ride distance (2 Plots)
- Ride Start and End Time Hour vs. Bike Id (2 Plots)
- Ride Start and End Time Day vs. Bike Id (2 Plots)
- Ride Duration in Seconds, Minutes, and Hours vs. Bike Id (3 Plots)
- Ride Duration in Seconds, Minutes, and Hours vs. Ride Distance (3 Plots)
- Ride Distance vs. Bike Id (1 Plots)
- There is a negative coorelation between member age and ride durations: Duration decreases when age increases.
- Ages 23-40 have the higher ride durations compared to other ages.
- Ages share the same distribution in both start and end time hours.
- There is a negative coorelation between member age and start and end time hour: start and end time hours decrease as ages increase: -Most start and end time hours are 7-9 AM and 4-6 PM.
- Ages 18-40 have the higher number of rides that start and end mostly at 7-9 AM and 4-6 PM.
- Similar to start and end time days, ages share the same distribution in both start and end time hours.
- There is a negative coorelation between member age and start and end time day: as age increases, start and end time hour and day decrease.
- Although the number of rides decreases as age increases, all ages start and end mostly at 4-9, 12-17 and 19-28.
- Ages 18-40 have the higher number of rides that start and end mostly at most start and end time days.
- 99.9% of distance(km) is below 8km and distance(km) ranges from 0.17km-8km.
- In general, there is a very weak positive coorelation between distance(km) and member age.
- There is a negative coorelation between the member age and bike ids used: as age increases, the range of bike ids used decreases.
- There is a very weak positive coorelation between start time and end hour and durations: as start time and end hours increase, durations increases.
- There is a very weak positive coorelation between start time and end day and durations: as start time and end day increase, durations increases.
C:\Users\sherif\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Insights:¶
- There is a very weak negative coorelation between start and end time hour and distance(km): as start and end time hour increases, slightly the distance(km) decreases.
Insights:¶
- There is a a very weak positive coorelation between start and end time day and distance(km): as start and end time day increases, slightly the distance(km) increases.
- There is a positive coorelation between start and end time hour and range of bike ids used: as start and end time hours increase, the range of bike ids used increases from 4k-above, with most used bikes ranges from 4.5k-5.3k.
- There is a positive coorelation between start and end time day and range of bike ids used: as start and end time days increase, the range of bike ids used increases from 4k-above, with most used bikes ranges from 4.5k-5.3k.
- There is a a weak negative coorelation between duration in seconds, minutes, and hours and bike ids used: as duration in seconds, minutes, and hours increase, the number of bike ids decreases.
- Most bike ids are used in a duration below 10k seconds, in a duration below 200 minutes, and in a duration below 5 hours.
- There is a negative coorelation between duration in seconds, minutes, and hours and distance(km): as duration in seconds, minutes, and hours increase, distance(km) decreases.
- Most distances are traveled in a duration below 20k seconds, most distances are traveled in a duration below 200 minutes, and most distances are traveled in a duration below 5 hours.
- There is a very week negative correlation between distance(km) and bike ids used: as distance increase, bike ides decreases.
- Most ride ids are used in a distance between 0.2km-below 8km.
We can plot the following 58 categorical vs. numeric relationships for the major categorical variables of user type, member gender, bike share status, age groups, weekdays, and ride start and end stations, compared to the major numeric variables of member ages, duration in seconds, minutes and hours, ride start and end time hour and day, bike id and ride distance.
Categorical Variable vs. Numeric Variable include:
- User Type vs. Member Age (1 Plot)
- User Type vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- User Type vs. Ride start and End Time Hour and Day(4 Plots)
- User Type vs. Bike Id and Ride Distance (2 Plot)
- Member Gender vs. Member Age (1 Plot)
- Member Gender vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- Member Gender vs. Ride Start and End Time Hour and Day (4 Plots)
- Member Gender vs. Bike Id and Ride Distance (2 Plot)
- Bike Share Status vs. Member Age (1 Plot)
- Bike Share Status vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- Bike Share Status vs. Ride Start and End Time Hour and Day (4 Plots)
- Bike Share Status vs. Bike Id and Ride Distance (2 Plot)
- Ride Stations vs. Member Gender (1 Plot)
- Ride Stations vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- Ride stations vs. Ride Start and End Time Hour and Day (4 Plots)
- Ride stations vs. Bike Id and Ride Distance (2 Plot)
- Weekday vs. Member Age (1 Plot)
- Weekday vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- Weekday vs. Ride start and End Time Hour and Day (4 Plots)
- Weekday vs. Bike Id and Ride Distance (2 Plot)
- Age groups vs. Duration in Seconds, Minutes, and Hours (3 Plots)
- Age groups vs. Ride start and End Time Hour and Day (4 Plots)
- Age groups vs. Bike Id and Ride Distance (2 Plot)
- Average age for both customer and subscriber is 34 years old.
- 25% of ages for both customer and subscriber are below 27 years old.
- Median age for both customer and subscriber is 32 years old.
- 75% of ages/customer are below 38 years old and for subscriber are below 39 years old.
- Max age/customer is 141 years old and for subscriber is 119 years old.
- Member age per user type follow a similar distribution.
- In general, customer statistics are as typical as those of subscriber, except for the 75% statistic which is higher for subscriber than that of customer but the max statistic for customers is higher than that of subscribers.
- Average duration(second) for customer is 1,432 seconds and for subscriber is 640 seconds.
- Average duration(minute) for customer is 24 minutes and for subscriber is 11 minutes.
- Average duration(hour) for customer is 0.4 hour and for subscriber is 0.18 hour.
- In general, average durations/customers are higher than those of subscribers.
- Average start time (hour) for customer is 1.6 PM and for subscriber is 1.3 PM.
- Average end time (hour) for customer is 1.9 PM and for subscriber is 1.6 PM.
- Average start and end time(day) for customer is 16 and for subscriber is 15.
- Start and end time hour for customers is 10 AM and for subscribers is 9 AM.
- Start and end time hour for customers is 9 AM and for subscribers is 8 AM.
- In general, start and end time hour and day for subscribers are earlier than thos of customers.
- Average start distance(km) for customer is 1.9km and for subscriber is 1.7km.
- Average most used bike id for customer is 4,226 and for subscriber is 4503.
- In general, average distance(km) traveled by and bike id used by customers are higher than those of subscribers.
- Average age of female members is 33 years old.
- Average age of male members is 34 years old.
- Average age of other members is 36 years old.
- In general, average age of other members is higher than those of subscribers and male age is higher than female age.
- Average duration (second) for female is 779 seconds, for male is 673 seconds, for not-defined is 1,189 seconds, and for other is 997 seconds.
- Average duration (minutes) for female is 13 minutes, for male is 11 minutes, for not-defined is 20 minutes, and for other is 17 minutes.
- Average duration (hour) for female is 0.22 hour, for male is 0.19 hour, for not-defined is 0.33 hours, and for other is 0.028 hours.
- In general, average durations spent by not-defined and other genders are higher than those of male and female genders, but females spent more duration than males.
- Average start time(hour) for female is 1.2 PM, for male is 1.5 PM, for not-defined is 1.5 PM, and for other is 1.7 PM.
- Average end time(hour) for female is 1.4 PM, for male is 1.7 PM, for not-defined is 1.7 PM, and for other is 1.8 PM.
- Average start and end time(day) for female is 15, for male is 15, for not-defined is 15, and for other is 15.
- In general, average start and end day per all genders is the same but average start and end time hour per female is earlier than males and all other genders.
- 25% of all genders' start time hour is below 9 AM and 75% start time hour is below 5 PM. 25% all genders' end time hour is below 9 AM, except for other gender (10 AM), and 75% end time hour is below 5 PM, except for male gender (6 PM).
- 25% of all genders' start and end time day is below day 8 of the month and 75% start time day is below day 22 of the month.
- Average distance(km) for female 1.8km, for male is 1.7km, for not-defined is 1.7km, and for other is 1.8km.
- Average used bike id for female is 4,397, for male is 4,507, for not-defined is 4,275, and for other is 4543.
- In general, average distance(km) traveled by females and others are higher than those of male and not-defined genders. Average used ids ranges from 4,275 to 4543.
- Average member age that share bikes during the ride is 34 years old.
- Average member age that doesn't share bikes during the ride is 32 years old.
- Average duration(second) for non-shared bike rides is 730 seconds and for shared rides is 684 seconds.
- Average duration(minute) for non-shared bike rides is 12 minutes and for shared rides is 11 minutes.
- Average duration(hour) for non-shared bike rides is 0.2 hours and for shared rides is 0.19 hour.
- In general, average durations for non-shared bike rides are higher than that of shared bike rides.
- Average start time(hour) for non-shared bike rides is 1.4 PM and for shared rides is 2.1 PM.
- Average end time (hour) for non-shared bike rides is 1.5 PM and for shared rides is 2.2 PM.
- Average start and end time(day) for both non-shared or shared bike rides is 3.3 PM.
- In general, start and end time(hour for non-shared bike rides is earlier than that of shared bike rides, while the average of start and end time(day) is same (3.3 PM).
- Average distance(km) for non-shared bike rides is 1.7km and for shared bike rides is 1.3km. Average distance shared is lower than that non-shared.
- Average used bike id for shared bike rides is 4483 and for shared bike rides is 4379.
- Average member age for San Fransisco 's Sations is 33 years old, for Oakland_Berkeley 's Sations is 34 years old, for San Jose 's Sations is 31 years old.
- Average duration (second) for San Fransisco 's sations is 812.6 seconds, for Oakland_Berkeley 's stations is 747.5 seconds, and for San Jose's sations is 752.6 seconds.
- Average duration (minute) for San Fransisco 's sations is 13.5 minutes, for Oakland_Berkeley 's stations is 12.5 minutes, and for San Jose's sations is 12.5 minutes.
- Average duration (hour) for San Fransisco 's sations is 0.23 hour, for Oakland_Berkeley 's stations is 0.21 hour, and for San Jose's sations is 0.21 hour. In general, San Fransisco 's sations have higher durations than both Oakland_Berkeley 's stations and San Jose's sations.
- Average start time(hour) for San Fransisco 's sations is 1.1 PM, for Oakland_Berkeley 's stations is 1.05 PM, and for San Jose's sations is 2.2 PM.
- Average start time(day) for San Fransisco 's sations is 16, for Oakland_Berkeley 's stations is 15, and for San Jose's sations is 16. In general, Oakland_Berkeley 's stations have earlier start time hour and day than San Fransisco 's sations or San Jose 's sations.
- Average distance(km) for San Fransisco 's sations is 1.9km, for Oakland_Berkeley 's stations is 1.7km, and for San Jose's sations is 1.7km.
- Average bike id used for San Fransisco 's sations is 4651, for Oakland_Berkeley 's stations is 4104, and for San Jose's sations is 3764. -In general, San Fransisco 's stations have higher distances than Oakland_Berkeley 's sations or San Jose 's sations.
- Average age per both start and end time weekday is 34 years old, except for Saturday and Sunday is 33 years old, and for Thursday is 35 years old.
- Average durations for week days (Saturday and Sunday) are higher than working days.
- Average duration(second) for working days(Monday-Friday) ranges from 663 to 713 seconds while weekend days has an average which between 903 and 920 seconds.
- Average duration(minute) for working days(Monday-Friday) ranges from 11.1 to 11.9 minuts while weekend days has an average which between 15 and 15.3 minuts.
- Average duration(hour) for working days(Monday-Friday) ranges from 0.18 to 0.2 hour while weekend days has an average which between 0.25 and 0.26 hour.
- Average start time (hour) for working days(Monday-Friday) ranges from 12.8 PM to 1.7 PM while weekend days has an average which between 1.7 to 2.2 PM.
- Average end time (hour) for working days(Monday-Friday) ranges from 1 PM to 1.9 PM while weekend days has an average which between 1.8 PM to 2.4 PM.
- Average start and end time (day)for working days(Monday-Friday) ranges from 13-17 while weekend days has an average which between 14-15
- In general, working days start and end time hours and days are earlier than those of weekend days.
- Average distance(km) for start and end time working days(Monday-Friday) ranges from 1.67km to 1.73km while weekend days has an average of 1.6km. -Average bike id most used for start and end time working days(Monday-Friday) ranges from 4380 to 4518 while weekend days has an average between 4575 to 4628
- In general, Average distance(km) for working days in start and end time working weekdays is higher than those of weekend days. And, the range of used bike ids in weekend days is higher than this of working days.
- Average duration for age group 60-70 is the highest (0.21 hour), followed by group 50-60 (0.204 hour), followed by group 18-20 (0.201 hour), followed by groups 20-50 (0.191-0.198 hour), and the lowest average duration is for group 70-141 (0.174 hour).
- Average start time(hour) for age groups 60-70 & 70-141 is the earliest (12.82 PM-12.87 PM), followed by group 20-30 to 50-60 (1 PM - 1.7 PM), and the most late average start time(hour) is for group 0-20 (2.4 PM).
- Average end time(hour) for age groups 60-70 & 70-141 is the earliest (1 PM- 1.06 PM), followed by group 20-30 to 50-60 (1.2 PM - 1.9 PM), and the most late average end time(hour) is for group 0-20 (2.5 PM).
- Average start and end time(day) for age group 0-20 is the earliest (day 14), followed by group 20-30 to 60-70 (day 15), and the most late average start and end time(day) is for group 70-141 (day 16).
- Average distance(km) for age group 30-40 is the highest(1.8km), followed by age groups 20-30 and 40-70 (1.6km-1.7km), and the lowest average distance(km) is for age groups 0-20 & 70-141(1.3km-1.5km).
- The higher average bike id most used is for age groups 20-40 (4517-4520), followed by age groups 0-20, 40-50, 50-60 & 70-141 (4331-4409), and the lowest average goes for the age group 60-70(4090).
Main features of interest include ride durations, ride times, ride weekdays, ride distance(km), ride stations/city, user types, member genders, member age, and age groups. I will investigate the relationships among these variables as follows:
Other additional features include ride bike share and bike id and how they are associated with main features. we can investigate them as follows:
- In this section, I will investigate and plot multiple variables to explore the patterns, trends, models, and relationships among three or more features. The main thing I want to explore in this part of the analysis is how these variables of interest correlate and impact one anothor.
- User characteristics: User Type, Member Gender, and Age Groups.
- Ride Start and End Times: Start and End Weekdays.
- Ride Stations: Start and End Stations
- Bike Share: Bike Share Status
- Member Age
- Ride Durations: Durations in Seconds, Minutes, and Hours
- Ride Start and End Times: Ride Start and End Times in Hour and Day
- Ride Distance(km)
- Bike Id
1.Correlation of Durations Vs. Other Numeric Variables:
- Durations vs. Start Time(hour): Very weak positive Coorelation
- Durations vs. End Time(hour): Neutral Coorelation
- Durations vs. Start Time(day): Very weak positive Coorelation
- Durations vs. End Time(day): Very weak positive Coorelation
- Durations vs. Member Age: Very weak negative Coorelation
- Durations vs. Distance(km): Very weak positive Coorelation
- Durations vs. Bike Id: Very weak negative Coorelation
2.Correlation of Start & End Time(hour) vs.. Other Numeric Variables:
- Start & End Time(hour) vs. Durations: Very weak positive Coorelation
- Start Time(hour) vs. End Time(hour): Strong positive Coorelation
- Start & End Time(hour) vs. Start and End Time(day): Very weak positive Coorelation
- Start & End Time(hour) vs. Member Age: Very weak negative Coorelation
- Start & End Time(hour) vs. Distance(km): Very weak negative Coorelation
- Start & End Time(hour) vs. Bike Id: Very weak positive Coorelation
3.Correlation of Start & End Time(day) vs. Other Numeric Variables:
- Start & End Time(day) vs. Durations: Very weak positive Coorelation
- Start & End Time(day) vs. Start & End Time(hour): Very weak positive Coorelation
- Start & End Time(day) vs. Member Age: Neutral Coorelation
- Start & End Time(day) vs. Distance(km): Very weak positive Coorelation
- Start & End Time(day) vs. Bike Id: Very weak positive Coorelation
4.Correlation of Member Age vs. Other Numeric Variables:
- Member Age vs. Durations: Very weak negative Coorelation
- Member Age vs. Start & End Time(hour): Very weak negative Coorelation
- Member Age vs. Start & End Time(day): Neutral Coorelation
- Member Age vs. Distance(km): Very weak positive Coorelation
- Member Age vs. Bike Id: Very weak negative Coorelation
5.Correlation of Distance(km) vs. Other Numeric Variables:
- Distance(km) vs. Durations: Very weak positive Coorelation
- Distance(km) vs. Start & End Time(hour): Very weak negative Coorelation
- Distance(km) vs. Start & End Time(day): Very weak positive Coorelation
- Distance(km) vs. Member Age: Very weak positive Coorelation
- Distance(km) vs. Bike Id: Very weak positive Coorelation
6.Correlation of Bike Id vs. Other Numeric Variables:
- Bike Id vs. Durations: Very weak negative Coorelation
- Bike Id vs. Start & End Time(hour): Very weak positive Coorelation
- Bike Id vs. Start & End Time(day): Very weak positive Coorelation
- Bike Id vs. Member Age: Very weak negative Coorelation
- Bike Id vs. Bike Id: Very weak positive Coorelation
- There is a negative coorelation between member age and distance(km) based on duration: as member age increases, the distance over durations in seconds, minutes, and hours decreases.
Avg. member age per user type and member gender can be summarized as follows:
Avg. distance(km) per user type and member gender can be summarized as follows:
Avg. Durations per user type and member gender can be summarized as follows:
Avg. start time hour and day per user type and member gender can be summarized as follows:
Avg. Bike Id per user type and member gender can be summarized as follows:
I extended my investigation of categorical variables including user type, member gender, ride weekday, age groups, ride stations/city, and bike share against numeric variables including ride durations, ride times, member age, distance(km) and bike id. My goal is to explore how these factors impact the usage of GoBike service and how they correlate to one another. So, let's explore multiple variable relationships as follows:
Avg. member age per user type and member gender can be summarized as follows:
Avg. member age per user type and member gender can be summarized as follows:
Avg. Durations per user type and member gender can be summarized as follows:
Some interesting or surprising interactions may include:
In conclusion, Ford GoBike service dataset for the month of February consists of 183,412 rides and subscriber users have most of the rides at 89% of totoal rides while customers make 11% of the total rides. Findings can be summarized as follows:
Customers have higher duration(minute) for all genders(20.9-34.4 minutes) than subscribers'(10.3-15.2 minutes) and they have higher duration(hour) for all genders(0.3-0.6 hour) than subscribers'(0.2-0.3 hour) and females' duration(hour)(0.22 hour) is higher than that of males(0.19 hour).
Avg. start time(hour) for working day (12.8-1.7 PM) is earlier than that of weekend days(1.7-2.3 PM). Avg. start time hour for all genders and user types is 1PM-2PM. Avg. start time(hour) for subscriber(1.4 PM) is earlier than that of customer(1.6 PM) and avg. start time(hour) for Female gender (1.2 PM) is earlier than that of Male (1.52 PM).
Avg. age for subscriber(33 years) is higher than that of customer(28 years) and avg. age for males(34 years) is higher than that of females(33 years).
Avg. distance(km) for customers (1.9km) is higher than that of subscribers(1.7km) and avg. distance(km) for females(1.77km) is higher than that of males(1.66km). Avg. distance(km) for start & end working day (1.7-1.72km) is higher than that of weekend days(1.6km)
Average duration(minute) for working days(Monday-Friday) ranges from 11.1 to 11.9 minuts while weekend days has an average which ranges between 15 and 15.3 minuts and average duration(hour) for working days(Monday-Friday) ranges from 0.18 to 0.2 hour while weekend days has an average which between 0.25 and 0.26 hour. Average start time (hour) for working days(Monday-Friday) ranges from 12.8 PM to 1.7 PM while weekend days has an average which between 1.7 to 2.2 PM. Average distance(km) for start and end time working days(Monday-Friday) ranges from 1.67km to 1.73km while weekend days has an average of 1.6km.
Avg. Duration(minute) for non-shared bikes during the ride(12 minutes) is higher than that of shared bikes(11 minutes). Avg. distance(km) for non-shared bikes is 1.7km and for shared bikes is 1.3km.
Avg. Duration(minute) for age group 60-70 (12.73 minutes) is higher than that other groups(10.4-12.26 minutes). The middle groups 18-60 has average of 11.5-12.27 minutes and the loweest age group is 70-140(10.4 minutes). Avg. Distance(km) for age group 30-40 (1.8km) is higher than other age groups(1.3km-1.7km). The middle age groups(18-20 & 20-30 & 70-141) have an average distance of 1.5km-1.7km and the lowest age group 18-20 has an average distance of 1.3km.
San Fransisco has 156 stations, Oakland_Berkeley have 127 stations, and San Jose has 47 stations. Avg. number of rides per station in San Fransisco is , in Oakland_Berkeley is 1,480, and in San Jose is 295. Average member age for San Fransisco 's Sations is 33 years old, for Oakland_Berkeley 's Sations is 34 years old, for San Jose 's Sations is 31 years old. Average duration (hour) for San Fransisco 's sations is 0.23 hour, for Oakland_Berkeley and San Jose's stations is 0.21 hour. Average distance(km) for San Fransisco 's sations is 1.9km, for Oakland_Berkeley and San Jose 's stations is 1.7km.
I have used the following sources in my exploration:
- Wikipedia and BikeShare.cc - Introduction about Ford GoBike.
- Github.io - Handling Missing Values
- Github.com - Haversine Formula - Calculate distance between latitude longitude pairs with Python
- Nbconvert - Exclude input cell code with TemplateExporter